first cell of your notebook you must include your names, project title, and a hyperlink to your webpage at github.io; the webpage must be publicly readable on the internet (i.e, live) and must contain the same work that is in the submitted notebook.That is: the first cell of your notebook must be a markdown cell with a hyperlink to the generated webpage up at yourname.github.io
(3 Points) Slide: Your slide includes includes your name, link to your website, the datasets you hope to use, and a question you are currently considering.
(3 Points) Pitch: Your pitch takes no more than 2-3 mins, is coherent, and you are an active participant in class on the day of the pitch in person only.
The project primarily investigates the data related to health factors of each counties in USA. Health factors here include health behaviors, clinical care, socio-economic factors, physical enviornment and other health outcomes. Using available data along with additional public datasets, I plan to find the find possible discoveries regarding what variables are most responsible for health outcomes. I am sure there are metrics to measure like correlations to differentiate those. Using the variables, I plan to create a model and possibly test with new data sources.
I plan to first find more datasets that I can relate this dataset to, and thus have more available dependent measures that could infulence the health outcomes. Maybe, the demographics, education quality, or presence or absence of certain institutions could add more light to the health results. Github will be primarily used to store all the data and notebooks.
pip install missingno
Collecting missingno Downloading missingno-0.5.2-py3-none-any.whl (8.7 kB) Requirement already satisfied: numpy in /opt/conda/lib/python3.11/site-packages (from missingno) (1.24.4) Requirement already satisfied: matplotlib in /opt/conda/lib/python3.11/site-packages (from missingno) (3.7.2) Requirement already satisfied: scipy in /opt/conda/lib/python3.11/site-packages (from missingno) (1.11.2) Requirement already satisfied: seaborn in /opt/conda/lib/python3.11/site-packages (from missingno) (0.12.2) Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (1.1.0) Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (4.42.1) Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (1.4.5) Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (23.1) Requirement already satisfied: pillow>=6.2.0 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (10.0.0) Requirement already satisfied: pyparsing<3.1,>=2.3.1 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (2.8.2) Requirement already satisfied: pandas>=0.25 in /opt/conda/lib/python3.11/site-packages (from seaborn->missingno) (2.0.3) Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.11/site-packages (from pandas>=0.25->seaborn->missingno) (2023.3) Requirement already satisfied: tzdata>=2022.1 in /opt/conda/lib/python3.11/site-packages (from pandas>=0.25->seaborn->missingno) (2023.3) Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib->missingno) (1.16.0) Installing collected packages: missingno Successfully installed missingno-0.5.2 Note: you may need to restart the kernel to use updated packages.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# import pycountry_convert as pc
import missingno as mno
import warnings
URL = "https://www.countyhealthrankings.org/sites/default/files/media/document/analytic_data2023_0.csv"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0'}
with warnings.catch_warnings():
warnings.simplefilter('ignore')
df = pd.read_csv(URL, storage_options=headers);
# with warnings.catch_warnings():
# warnings.simplefilter('ignore')
# df = pd.read_csv("data/analytic_data2023_0.csv")
df.head()
| State FIPS Code | County FIPS Code | 5-digit FIPS Code | State Abbreviation | Name | Release Year | County Ranked (Yes=1/No=0) | Premature Death raw value | Premature Death numerator | Premature Death denominator | ... | % Female raw value | % Female numerator | % Female denominator | % Female CI low | % Female CI high | % Rural raw value | % Rural numerator | % Rural denominator | % Rural CI low | % Rural CI high | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | statecode | countycode | fipscode | state | county | year | county_ranked | v001_rawvalue | v001_numerator | v001_denominator | ... | v057_rawvalue | v057_numerator | v057_denominator | v057_cilow | v057_cihigh | v058_rawvalue | v058_numerator | v058_denominator | v058_cilow | v058_cihigh |
| 1 | 00 | 000 | 00000 | US | United States | 2023 | NaN | 7281.9355638 | 4125218 | 917267406 | ... | 0.5047067187 | 167509003 | 331893745 | NaN | NaN | 0.193 | NaN | NaN | NaN | NaN |
| 2 | 01 | 000 | 01000 | AL | Alabama | 2023 | NaN | 10350.071456 | 88086 | 13668498 | ... | 0.5142542169 | 2591778 | 5039877 | NaN | NaN | 0.409631829 | 1957932 | 4779736 | NaN | NaN |
| 3 | 01 | 001 | 01001 | AL | Autauga County | 2023 | 1 | 8027.3947267 | 836 | 156081 | ... | 0.513782892 | 30362 | 59095 | NaN | NaN | 0.4200216232 | 22921 | 54571 | NaN | NaN |
| 4 | 01 | 003 | 01003 | AL | Baldwin County | 2023 | 1 | 8118.3582061 | 3377 | 614143 | ... | 0.5134771453 | 122872 | 239294 | NaN | NaN | 0.4227909911 | 77060 | 182265 | NaN | NaN |
5 rows × 720 columns
df.shape
(3195, 720)
# Display all the columns
for col in df.columns:
print(col)
State FIPS Code County FIPS Code 5-digit FIPS Code State Abbreviation Name Release Year County Ranked (Yes=1/No=0) Premature Death raw value Premature Death numerator Premature Death denominator Premature Death CI low Premature Death CI high Premature Death flag (0 = No Flag/1=Unreliable/2=Suppressed) Premature Death (AIAN) Premature Death CI low (AIAN) Premature Death CI high (AIAN) Premature Death flag (AIAN) (. = No Flag/1=Unreliable/2=Suppressed) Premature Death (Asian/Pacific Islander) Premature Death CI low (Asian/Pacific Islander) Premature Death CI high (Asian/Pacific Islander) Premature Death flag (Asian/Pacific Islander) (. = No Flag/1=Unreliable/2=Suppressed) Premature Death (Black) Premature Death CI low (Black) Premature Death CI high (Black) Premature Death flag (Black) (. = No Flag/1=Unreliable/2=Suppressed) Premature Death (Hispanic) Premature Death CI low (Hispanic) Premature Death CI high (Hispanic) Premature Death flag (Hispanic) (. = No Flag/1=Unreliable/2=Suppressed) Premature Death (White) Premature Death CI low (White) Premature Death CI high (White) Premature Death flag (White) (. = No Flag/1=Unreliable/2=Suppressed) Poor or Fair Health raw value Poor or Fair Health numerator Poor or Fair Health denominator Poor or Fair Health CI low Poor or Fair Health CI high Poor Physical Health Days raw value Poor Physical Health Days numerator Poor Physical Health Days denominator Poor Physical Health Days CI low Poor Physical Health Days CI high Poor Mental Health Days raw value Poor Mental Health Days numerator Poor Mental Health Days denominator Poor Mental Health Days CI low Poor Mental Health Days CI high Low Birthweight raw value Low Birthweight numerator Low Birthweight denominator Low Birthweight CI low Low Birthweight CI high LBW unreliable indicator (Unreliable = Numerator < 20 or relative standard error > 20%) Low Birthweight (AIAN) Low Birthweight CI low (AIAN) Low Birthweight CI high (AIAN) Low Birthweight (Asian/Pacific Islander) Low Birthweight CI low (Asian/Pacific Islander) Low Birthweight CI high (Asian/Pacific Islander) Low Birthweight (Black) Low Birthweight CI low (Black) Low Birthweight CI high (Black) Low Birthweight (Hispanic) Low Birthweight CI low (Hispanic) Low Birthweight CI high (Hispanic) Low Birthweight (White) Low Birthweight CI low (White) Low Birthweight CI high (White) Adult Smoking raw value Adult Smoking numerator Adult Smoking denominator Adult Smoking CI low Adult Smoking CI high Adult Obesity raw value Adult Obesity numerator Adult Obesity denominator Adult Obesity CI low Adult Obesity CI high Food Environment Index raw value Food Environment Index numerator Food Environment Index denominator Food Environment Index CI low Food Environment Index CI high Physical Inactivity raw value Physical Inactivity numerator Physical Inactivity denominator Physical Inactivity CI low Physical Inactivity CI high Access to Exercise Opportunities raw value Access to Exercise Opportunities numerator Access to Exercise Opportunities denominator Access to Exercise Opportunities CI low Access to Exercise Opportunities CI high Excessive Drinking raw value Excessive Drinking numerator Excessive Drinking denominator Excessive Drinking CI low Excessive Drinking CI high Alcohol-Impaired Driving Deaths raw value Alcohol-Impaired Driving Deaths numerator Alcohol-Impaired Driving Deaths denominator Alcohol-Impaired Driving Deaths CI low Alcohol-Impaired Driving Deaths CI high Sexually Transmitted Infections raw value Sexually Transmitted Infections numerator Sexually Transmitted Infections denominator Sexually Transmitted Infections CI low Sexually Transmitted Infections CI high Teen Births raw value Teen Births numerator Teen Births denominator Teen Births CI low Teen Births CI high Teen Births (AIAN) Teen Births CI low (AIAN) Teen Births CI high (AIAN) Teen Births (Asian/Pacific Islander) Teen Births CI low (Asian/Pacific Islander) Teen Births CI high (Asian/Pacific Islander) Teen Births (Black) Teen Births CI low (Black) Teen Births CI high (Black) Teen Births (Hispanic) Teen Births CI low (Hispanic) Teen Births CI high (Hispanic) Teen Births (White) Teen Births CI low (White) Teen Births CI high (White) Uninsured raw value Uninsured numerator Uninsured denominator Uninsured CI low Uninsured CI high Primary Care Physicians raw value Primary Care Physicians numerator Primary Care Physicians denominator Primary Care Physicians CI low Primary Care Physicians CI high Ratio of population to primary care physicians. Dentists raw value Dentists numerator Dentists denominator Dentists CI low Dentists CI high Ratio of population to dentists. Mental Health Providers raw value Mental Health Providers numerator Mental Health Providers denominator Mental Health Providers CI low Mental Health Providers CI high Ratio of population to mental health providers. Preventable Hospital Stays raw value Preventable Hospital Stays numerator Preventable Hospital Stays denominator Preventable Hospital Stays CI low Preventable Hospital Stays CI high Preventable Hospital Stays (AIAN) Preventable Hospital Stays (Asian/Pacific Islander) Preventable Hospital Stays (Black) Preventable Hospital Stays (Hispanic) Preventable Hospital Stays (White) Mammography Screening raw value Mammography Screening numerator Mammography Screening denominator Mammography Screening CI low Mammography Screening CI high Mammography Screening (AIAN) Mammography Screening (Asian/Pacific Islander) Mammography Screening (Black) Mammography Screening (Hispanic) Mammography Screening (White) Flu Vaccinations raw value Flu Vaccinations numerator Flu Vaccinations denominator Flu Vaccinations CI low Flu Vaccinations CI high Flu Vaccinations (AIAN) Flu Vaccinations (Asian/Pacific Islander) Flu Vaccinations (Black) Flu Vaccinations (Hispanic) Flu Vaccinations (White) High School Completion raw value High School Completion numerator High School Completion denominator High School Completion CI low High School Completion CI high Some College raw value Some College numerator Some College denominator Some College CI low Some College CI high Unemployment raw value Unemployment numerator Unemployment denominator Unemployment CI low Unemployment CI high Children in Poverty raw value Children in Poverty numerator Children in Poverty denominator Children in Poverty CI low Children in Poverty CI high Children in Poverty (AIAN) Children in Poverty CI low (AIAN) Children in Poverty CI high (AIAN) Children in Poverty (Asian/Pacific Islander) Children in Poverty CI low (Asian/Pacific Islander) Children in Poverty CI high (Asian/Pacific Islander) Children in Poverty (Black) Children in Poverty CI low (Black) Children in Poverty CI high (Black) Children in Poverty (Hispanic) Children in Poverty CI low (Hispanic) Children in Poverty CI high (Hispanic) Children in Poverty (White) Children in Poverty CI low (White) Children in Poverty CI high (White) Income Inequality raw value Income Inequality numerator Income Inequality denominator Income Inequality CI low Income Inequality CI high Children in Single-Parent Households raw value Children in Single-Parent Households numerator Children in Single-Parent Households denominator Children in Single-Parent Households CI low Children in Single-Parent Households CI high Social Associations raw value Social Associations numerator Social Associations denominator Social Associations CI low Social Associations CI high Injury Deaths raw value Injury Deaths numerator Injury Deaths denominator Injury Deaths CI low Injury Deaths CI high Injury Deaths (AIAN) Injury Deaths CI low (AIAN) Injury Deaths CI high (AIAN) Injury Deaths (Asian/Pacific Islander) Injury Deaths CI low (Asian/Pacific Islander) Injury Deaths CI high (Asian/Pacific Islander) Injury Deaths (Black) Injury Deaths CI low (Black) Injury Deaths CI high (Black) Injury Deaths (Hispanic) Injury Deaths CI low (Hispanic) Injury Deaths CI high (Hispanic) Injury Deaths (White) Injury Deaths CI low (White) Injury Deaths CI high (White) Air Pollution - Particulate Matter raw value Air Pollution - Particulate Matter numerator Air Pollution - Particulate Matter denominator Air Pollution - Particulate Matter CI low Air Pollution - Particulate Matter CI high Drinking Water Violations raw value Drinking Water Violations numerator Drinking Water Violations denominator Drinking Water Violations CI low Drinking Water Violations CI high Severe Housing Problems raw value Severe Housing Problems numerator Severe Housing Problems denominator Severe Housing Problems CI low Severe Housing Problems CI high Percentage of households with high housing costs Percentage of households with high housing costs CI low Percentage of households with high housing costs CI high Percentage of households with overcrowding Percentage of households with overcrowding CI low Percentage of households with overcrowding CI high Percentage of households with lack of kitchen or plumbing facilities Percentage of households with lack of kitchen or plumbing facilities CI low Percentage of households with lack of kitchen or plumbing facilities CI high Driving Alone to Work raw value Driving Alone to Work numerator Driving Alone to Work denominator Driving Alone to Work CI low Driving Alone to Work CI high Driving Alone to Work (AIAN) Driving Alone to Work CI low (AIAN) Driving Alone to Work CI high (AIAN) Driving Alone to Work (Asian/Pacific Islander) Driving Alone to Work CI low (Asian/Pacific Islander) Driving Alone to Work CI high (Asian/Pacific Islander) Driving Alone to Work (Black) Driving Alone to Work CI low (Black) Driving Alone to Work CI high (Black) Driving Alone to Work (Hispanic) Driving Alone to Work CI low (Hispanic) Driving Alone to Work CI high (Hispanic) Driving Alone to Work (White) Driving Alone to Work CI low (White) Driving Alone to Work CI high (White) Long Commute - Driving Alone raw value Long Commute - Driving Alone numerator Long Commute - Driving Alone denominator Long Commute - Driving Alone CI low Long Commute - Driving Alone CI high Life Expectancy raw value Life Expectancy numerator Life Expectancy denominator Life Expectancy CI low Life Expectancy CI high Life Expectancy (AIAN) Life Expectancy CI low (AIAN) Life Expectancy CI high (AIAN) Life Expectancy (Asian/Pacific Islander) Life Expectancy CI low (Asian/Pacific Islander) Life Expectancy CI high (Asian/Pacific Islander) Life Expectancy (Black) Life Expectancy CI low (Black) Life Expectancy CI high (Black) Life Expectancy (Hispanic) Life Expectancy CI low (Hispanic) Life Expectancy CI high (Hispanic) Life Expectancy (White) Life Expectancy CI low (White) Life Expectancy CI high (White) Premature Age-Adjusted Mortality raw value Premature Age-Adjusted Mortality numerator Premature Age-Adjusted Mortality denominator Premature Age-Adjusted Mortality CI low Premature Age-Adjusted Mortality CI high Premature Age-Adjusted Mortality (AIAN) Premature Age-Adjusted Mortality CI low (AIAN) Premature Age-Adjusted Mortality CI high (AIAN) Premature Age-Adjusted Mortality (Asian/Pacific Islander) Premature Age-Adjusted Mortality CI low (Asian/Pacific Islander) Premature Age-Adjusted Mortality CI high (Asian/Pacific Islander) Premature Age-Adjusted Mortality (Black) Premature Age-Adjusted Mortality CI low (Black) Premature Age-Adjusted Mortality CI high (Black) Premature Age-Adjusted Mortality (Hispanic) Premature Age-Adjusted Mortality CI low (Hispanic) Premature Age-Adjusted Mortality CI high (Hispanic) Premature Age-Adjusted Mortality (White) Premature Age-Adjusted Mortality CI low (White) Premature Age-Adjusted Mortality CI high (White) Child Mortality raw value Child Mortality numerator Child Mortality denominator Child Mortality CI low Child Mortality CI high Child Mortality (AIAN) Child Mortality CI low (AIAN) Child Mortality CI high (AIAN) Child Mortality (Asian/Pacific Islander) Child Mortality CI low (Asian/Pacific Islander) Child Mortality CI high (Asian/Pacific Islander) Child Mortality (Black) Child Mortality CI low (Black) Child Mortality CI high (Black) Child Mortality (Hispanic) Child Mortality CI low (Hispanic) Child Mortality CI high (Hispanic) Child Mortality (White) Child Mortality CI low (White) Child Mortality CI high (White) Infant Mortality raw value Infant Mortality numerator Infant Mortality denominator Infant Mortality CI low Infant Mortality CI high Infant Mortality (AIAN) Infant Mortality CI low (AIAN) Infant Mortality CI high (AIAN) Infant Mortality (Asian/Pacific Islander) Infant Mortality CI low (Asian/Pacific Islander) Infant Mortality CI high (Asian/Pacific Islander) Infant Mortality (Black) Infant Mortality CI low (Black) Infant Mortality CI high (Black) Infant Mortality (Hispanic) Infant Mortality CI low (Hispanic) Infant Mortality CI high (Hispanic) Infant Mortality (White) Infant Mortality CI low (White) Infant Mortality CI high (White) Frequent Physical Distress raw value Frequent Physical Distress numerator Frequent Physical Distress denominator Frequent Physical Distress CI low Frequent Physical Distress CI high Frequent Mental Distress raw value Frequent Mental Distress numerator Frequent Mental Distress denominator Frequent Mental Distress CI low Frequent Mental Distress CI high Diabetes Prevalence raw value Diabetes Prevalence numerator Diabetes Prevalence denominator Diabetes Prevalence CI low Diabetes Prevalence CI high HIV Prevalence raw value HIV Prevalence numerator HIV Prevalence denominator HIV Prevalence CI low HIV Prevalence CI high Food Insecurity raw value Food Insecurity numerator Food Insecurity denominator Food Insecurity CI low Food Insecurity CI high Limited Access to Healthy Foods raw value Limited Access to Healthy Foods numerator Limited Access to Healthy Foods denominator Limited Access to Healthy Foods CI low Limited Access to Healthy Foods CI high Drug Overdose Deaths raw value Drug Overdose Deaths numerator Drug Overdose Deaths denominator Drug Overdose Deaths CI low Drug Overdose Deaths CI high Drug Overdose Deaths (AIAN) Drug Overdose Deaths CI low (AIAN) Drug Overdose Deaths CI high (AIAN) Drug Overdose Deaths (Asian/Pacific Islander) Drug Overdose Deaths CI low (Asian/Pacific Islander) Drug Overdose Deaths CI high (Asian/Pacific Islander) Drug Overdose Deaths (Black) Drug Overdose Deaths CI low (Black) Drug Overdose Deaths CI high (Black) Drug Overdose Deaths (Hispanic) Drug Overdose Deaths CI low (Hispanic) Drug Overdose Deaths CI high (Hispanic) Drug Overdose Deaths (White) Drug Overdose Deaths CI low (White) Drug Overdose Deaths CI high (White) Insufficient Sleep raw value Insufficient Sleep numerator Insufficient Sleep denominator Insufficient Sleep CI low Insufficient Sleep CI high Uninsured Adults raw value Uninsured Adults numerator Uninsured Adults denominator Uninsured Adults CI low Uninsured Adults CI high Uninsured Children raw value Uninsured Children numerator Uninsured Children denominator Uninsured Children CI low Uninsured Children CI high Other Primary Care Providers raw value Other Primary Care Providers numerator Other Primary Care Providers denominator Other Primary Care Providers CI low Other Primary Care Providers CI high Ratio of population to primary care providers other than physicians. High School Graduation raw value High School Graduation numerator High School Graduation denominator High School Graduation CI low High School Graduation CI high Disconnected Youth raw value Disconnected Youth numerator Disconnected Youth denominator Disconnected Youth CI low Disconnected Youth CI high Reading Scores raw value Reading Scores numerator Reading Scores denominator Reading Scores CI low Reading Scores CI high Reading Scores (AIAN) Reading Scores (Asian/Pacific Islander) Reading Scores (Black) Reading Scores (Hispanic) Reading Scores (White) Math Scores raw value Math Scores numerator Math Scores denominator Math Scores CI low Math Scores CI high Math Scores (AIAN) Math Scores (Asian/Pacific Islander) Math Scores (Black) Math Scores (Hispanic) Math Scores (White) School Segregation raw value School Segregation numerator School Segregation denominator School Segregation CI low School Segregation CI high School Funding Adequacy raw value School Funding Adequacy numerator School Funding Adequacy denominator School Funding Adequacy CI low School Funding Adequacy CI high Gender Pay Gap raw value Gender Pay Gap numerator Gender Pay Gap denominator Gender Pay Gap CI low Gender Pay Gap CI high Median Household Income raw value Median Household Income numerator Median Household Income denominator Median Household Income CI low Median Household Income CI high Median Household Income (AIAN) Median Household Income CI low (AIAN) Median Household Income CI high (AIAN) Median household income (Asian) Median household income CI low (Asian) Median household income CI high (Asian) Median Household Income (Black) Median Household Income CI low (Black) Median Household Income CI high (Black) Median Household Income (Hispanic) Median Household Income CI low (Hispanic) Median Household Income CI high (Hispanic) Median Household Income (White) Median Household Income CI low (White) Median Household Income CI high (White) Living Wage raw value Living Wage numerator Living Wage denominator Living Wage CI low Living Wage CI high Children Eligible for Free or Reduced Price Lunch raw value Children Eligible for Free or Reduced Price Lunch numerator Children Eligible for Free or Reduced Price Lunch denominator Children Eligible for Free or Reduced Price Lunch CI low Children Eligible for Free or Reduced Price Lunch CI high Residential Segregation - Black/White raw value Residential Segregation - Black/White numerator Residential Segregation - Black/White denominator Residential Segregation - Black/White CI low Residential Segregation - Black/White CI high Child Care Cost Burden raw value Child Care Cost Burden numerator Child Care Cost Burden denominator Child Care Cost Burden CI low Child Care Cost Burden CI high Child Care Centers raw value Child Care Centers numerator Child Care Centers denominator Child Care Centers CI low Child Care Centers CI high Homicides raw value Homicides numerator Homicides denominator Homicides CI low Homicides CI high Homicides (AIAN) Homicides CI low (AIAN) Homicides CI high (AIAN) Homicides (Asian/Pacific Islander) Homicides CI low (Asian/Pacific Islander) Homicides CI high (Asian/Pacific Islander) Homicides (Black) Homicides CI low (Black) Homicides CI high (Black) Homicides (Hispanic) Homicides CI low (Hispanic) Homicides CI high (Hispanic) Homicides (White) Homicides CI low (White) Homicides CI high (White) Suicides raw value Suicides numerator Suicides denominator Suicides CI low Suicides CI high Crude suicide rate Suicides (AIAN) Suicides CI low (AIAN) Suicides CI high (AIAN) Suicides (Asian/Pacific Islander) Suicides CI low (Asian/Pacific Islander) Suicides CI high (Asian/Pacific Islander) Suicides (Black) Suicides CI low (Black) Suicides CI high (Black) Suicides (Hispanic) Suicides CI low (Hispanic) Suicides CI high (Hispanic) Suicides (White) Suicides CI low (White) Suicides CI high (White) Firearm Fatalities raw value Firearm Fatalities numerator Firearm Fatalities denominator Firearm Fatalities CI low Firearm Fatalities CI high Firearm Fatalities (AIAN) Firearm Fatalities CI low (AIAN) Firearm Fatalities CI high (AIAN) Firearm Fatalities (Asian/Pacific Islander) Firearm Fatalities CI low (Asian/Pacific Islander) Firearm Fatalities CI high (Asian/Pacific Islander) Firearm Fatalities (Black) Firearm Fatalities CI low (Black) Firearm Fatalities CI high (Black) Firearm Fatalities (Hispanic) Firearm Fatalities CI low (Hispanic) Firearm Fatalities CI high (Hispanic) Firearm Fatalities (White) Firearm Fatalities CI low (White) Firearm Fatalities CI high (White) Motor Vehicle Crash Deaths raw value Motor Vehicle Crash Deaths numerator Motor Vehicle Crash Deaths denominator Motor Vehicle Crash Deaths CI low Motor Vehicle Crash Deaths CI high Motor Vehicle Crash Deaths (AIAN) Motor Vehicle Crash Deaths CI low (AIAN) Motor Vehicle Crash Deaths CI high (AIAN) Motor Vehicle Crash Deaths (Asian/Pacific Islander) Motor Vehicle Crash Deaths CI low (Asian/Pacific Islander) Motor Vehicle Crash Deaths CI high (Asian/Pacific Islander) Motor Vehicle Crash Deaths (Black) Motor Vehicle Crash Deaths CI low (Black) Motor Vehicle Crash Deaths CI high (Black) Motor Vehicle Crash Deaths (Hispanic) Motor Vehicle Crash Deaths CI low (Hispanic) Motor Vehicle Crash Deaths CI high (Hispanic) Motor Vehicle Crash Deaths (White) Motor Vehicle Crash Deaths CI low (White) Motor Vehicle Crash Deaths CI high (White) Juvenile Arrests raw value Juvenile Arrests numerator Juvenile Arrests denominator Juvenile Arrests CI low Juvenile Arrests CI high Number of juvenile delinquency cases formally processed by a juvenile court Number of informally handled juvenile delinquency cases Voter Turnout raw value Voter Turnout numerator Voter Turnout denominator Voter Turnout CI low Voter Turnout CI high Census Participation raw value Census Participation numerator Census Participation denominator Census Participation CI low Census Participation CI high Traffic Volume raw value Traffic Volume numerator Traffic Volume denominator Traffic Volume CI low Traffic Volume CI high Homeownership raw value Homeownership numerator Homeownership denominator Homeownership CI low Homeownership CI high Severe Housing Cost Burden raw value Severe Housing Cost Burden numerator Severe Housing Cost Burden denominator Severe Housing Cost Burden CI low Severe Housing Cost Burden CI high Broadband Access raw value Broadband Access numerator Broadband Access denominator Broadband Access CI low Broadband Access CI high Population raw value Population numerator Population denominator Population CI low Population CI high % Below 18 Years of Age raw value % Below 18 Years of Age numerator % Below 18 Years of Age denominator % Below 18 Years of Age CI low % Below 18 Years of Age CI high % 65 and Older raw value % 65 and Older numerator % 65 and Older denominator % 65 and Older CI low % 65 and Older CI high % Non-Hispanic Black raw value % Non-Hispanic Black numerator % Non-Hispanic Black denominator % Non-Hispanic Black CI low % Non-Hispanic Black CI high % American Indian or Alaska Native raw value % American Indian or Alaska Native numerator % American Indian or Alaska Native denominator % American Indian or Alaska Native CI low % American Indian or Alaska Native CI high % Asian raw value % Asian numerator % Asian denominator % Asian CI low % Asian CI high % Native Hawaiian or Other Pacific Islander raw value % Native Hawaiian or Other Pacific Islander numerator % Native Hawaiian or Other Pacific Islander denominator % Native Hawaiian or Other Pacific Islander CI low % Native Hawaiian or Other Pacific Islander CI high % Hispanic raw value % Hispanic numerator % Hispanic denominator % Hispanic CI low % Hispanic CI high % Non-Hispanic White raw value % Non-Hispanic White numerator % Non-Hispanic White denominator % Non-Hispanic White CI low % Non-Hispanic White CI high % Not Proficient in English raw value % Not Proficient in English numerator % Not Proficient in English denominator % Not Proficient in English CI low % Not Proficient in English CI high % Female raw value % Female numerator % Female denominator % Female CI low % Female CI high % Rural raw value % Rural numerator % Rural denominator % Rural CI low % Rural CI high
This page describes about the idea behind the dataset. This link has all the datasets from different years to download. The dataset has 700+ features to work with, although there are similarities among multiple columns and missing data.
Primarily, the data columns can be divided in to health factors and health outcomes.
# This plot shows the missing data
# Longer the bar, lesser the missing data
mno.bar(df)
<Axes: >
for col in df.columns:
if df[col].isnull().sum()>1000:
df.drop([col], axis=1, inplace=True)
# cols from 720 to 326
df.shape
(3195, 326)
mno.bar(df)
<Axes: >
A lot of columns give repitative meaning. So, we extract the ones that is enough to represent the particular measurement.
# We need the raw values only
new_cols = [x for x in df.columns if "raw value" in x]
new_cols = list(df.columns[0:5]) + new_cols
# Replace % by percent
cols = list(map(lambda x:x.replace("%", "percent"), new_cols))
# Remove certain char and substring
cols = list(map(lambda x:x.replace("-", " "), cols))
cols = list(map(lambda x:x.replace(" raw value", ""), cols))
cols = list(map(lambda x:x.replace(" ", "_"), cols))
cols = list(map(lambda x:x.replace(" ", ""), cols))
cols
['State_FIPS_Code', 'County_FIPS_Code', '5_digit_FIPS_Code', 'State_Abbreviation', 'Name', 'Premature_Death', 'Poor_or_Fair_Health', 'Poor_Physical_Health_Days', 'Poor_Mental_Health_Days', 'Low_Birthweight', 'Adult_Smoking', 'Adult_Obesity', 'Food_Environment_Index', 'Physical_Inactivity', 'Access_to_Exercise_Opportunities', 'Excessive_Drinking', 'Alcohol_Impaired_Driving_Deaths', 'Sexually_Transmitted_Infections', 'Teen_Births', 'Uninsured', 'Primary_Care_Physicians', 'Dentists', 'Mental_Health_Providers', 'Preventable_Hospital_Stays', 'Mammography_Screening', 'Flu_Vaccinations', 'High_School_Completion', 'Some_College', 'Unemployment', 'Children_in_Poverty', 'Income_Inequality', 'Children_in_Single_Parent_Households', 'Social_Associations', 'Injury_Deaths', 'Air_Pollution___Particulate_Matter', 'Drinking_Water_Violations', 'Severe_Housing_Problems', 'Driving_Alone_to_Work', 'Long_Commute___Driving_Alone', 'Life_Expectancy', 'Premature_Age_Adjusted_Mortality', 'Frequent_Physical_Distress', 'Frequent_Mental_Distress', 'Diabetes_Prevalence', 'HIV_Prevalence', 'Food_Insecurity', 'Limited_Access_to_Healthy_Foods', 'Insufficient_Sleep', 'Uninsured_Adults', 'Uninsured_Children', 'Other_Primary_Care_Providers', 'High_School_Graduation', 'Reading_Scores', 'Math_Scores', 'School_Segregation', 'School_Funding_Adequacy', 'Gender_Pay_Gap', 'Median_Household_Income', 'Children_Eligible_for_Free_or_Reduced_Price_Lunch', 'Child_Care_Cost_Burden', 'Child_Care_Centers', 'Suicides', 'Firearm_Fatalities', 'Motor_Vehicle_Crash_Deaths', 'Voter_Turnout', 'Census_Participation', 'Traffic_Volume', 'Homeownership', 'Severe_Housing_Cost_Burden', 'Broadband_Access', 'Population', 'percent_Below_18_Years_of_Age', 'percent_65_and_Older', 'percent_Non_Hispanic_Black', 'percent_American_Indian_or_Alaska_Native', 'percent_Asian', 'percent_Native_Hawaiian_or_Other_Pacific_Islander', 'percent_Hispanic', 'percent_Non_Hispanic_White', 'percent_Not_Proficient_in_English', 'percent_Female', 'percent_Rural']
# Slice the dataframe
df = df[new_cols]
# Rename the columns
df = df.rename(columns=dict(zip(new_cols, cols)))
df.head(2)
| State_FIPS_Code | County_FIPS_Code | 5_digit_FIPS_Code | State_Abbreviation | Name | Premature_Death | Poor_or_Fair_Health | Poor_Physical_Health_Days | Poor_Mental_Health_Days | Low_Birthweight | ... | percent_65_and_Older | percent_Non_Hispanic_Black | percent_American_Indian_or_Alaska_Native | percent_Asian | percent_Native_Hawaiian_or_Other_Pacific_Islander | percent_Hispanic | percent_Non_Hispanic_White | percent_Not_Proficient_in_English | percent_Female | percent_Rural | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | statecode | countycode | fipscode | state | county | v001_rawvalue | v002_rawvalue | v036_rawvalue | v042_rawvalue | v037_rawvalue | ... | v053_rawvalue | v054_rawvalue | v055_rawvalue | v081_rawvalue | v080_rawvalue | v056_rawvalue | v126_rawvalue | v059_rawvalue | v057_rawvalue | v058_rawvalue |
| 1 | 00 | 000 | 00000 | US | United States | 7281.9355638 | 0.12 | 3 | 4.4 | 0.0819065527 | ... | 0.1682705801 | 0.1261202919 | 0.0131594526 | 0.0613162595 | 0.0026003593 | 0.1887563262 | 0.5930615866 | 0.0410440385 | 0.5047067187 | 0.193 |
2 rows × 82 columns
# remove the first row
df = df.drop([0])
df = df.reset_index(drop=True)
df.head(2)
| State_FIPS_Code | County_FIPS_Code | 5_digit_FIPS_Code | State_Abbreviation | Name | Premature_Death | Poor_or_Fair_Health | Poor_Physical_Health_Days | Poor_Mental_Health_Days | Low_Birthweight | ... | percent_65_and_Older | percent_Non_Hispanic_Black | percent_American_Indian_or_Alaska_Native | percent_Asian | percent_Native_Hawaiian_or_Other_Pacific_Islander | percent_Hispanic | percent_Non_Hispanic_White | percent_Not_Proficient_in_English | percent_Female | percent_Rural | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00 | 000 | 00000 | US | United States | 7281.9355638 | 0.12 | 3 | 4.4 | 0.0819065527 | ... | 0.1682705801 | 0.1261202919 | 0.0131594526 | 0.0613162595 | 0.0026003593 | 0.1887563262 | 0.5930615866 | 0.0410440385 | 0.5047067187 | 0.193 |
| 1 | 01 | 000 | 01000 | AL | Alabama | 10350.071456 | 0.189 | 3.4824161407 | 5.0732772786 | 0.1043276003 | ... | 0.1763568833 | 0.2651199623 | 0.0071444204 | 0.0155043466 | 0.0010883202 | 0.0478519615 | 0.6487709918 | 0.0102759588 | 0.5142542169 | 0.409631829 |
2 rows × 82 columns
# Checking the states
df["State_Abbreviation"].unique()
array(['US', 'AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL',
'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD',
'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM',
'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN',
'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY'], dtype=object)
df[df["State_Abbreviation"] =="WY"].head(3)
| State_FIPS_Code | County_FIPS_Code | 5_digit_FIPS_Code | State_Abbreviation | Name | Premature_Death | Poor_or_Fair_Health | Poor_Physical_Health_Days | Poor_Mental_Health_Days | Low_Birthweight | ... | percent_65_and_Older | percent_Non_Hispanic_Black | percent_American_Indian_or_Alaska_Native | percent_Asian | percent_Native_Hawaiian_or_Other_Pacific_Islander | percent_Hispanic | percent_Non_Hispanic_White | percent_Not_Proficient_in_English | percent_Female | percent_Rural | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3170 | 56 | 0 | 56000 | WY | Wyoming | 7809.903503 | 0.115 | 2.698914 | 4.130766 | 0.090792 | ... | 0.179469 | 0.010394 | 0.028395 | 0.010935 | 0.001012 | 0.10554 | 0.833306 | 0.006424 | 0.48823 | 0.35242 |
| 3171 | 56 | 1 | 56001 | WY | Albany County | 5133.53187 | 0.11 | 2.90064 | 4.179786 | 0.085394 | ... | 0.129866 | 0.012949 | 0.013162 | 0.034567 | 0.001409 | 0.101627 | 0.821581 | 0.006262 | 0.47817 | 0.119397 |
| 3172 | 56 | 3 | 56003 | WY | Big Horn County | 9097.45733 | 0.123 | 2.998264 | 3.865339 | 0.069968 | ... | 0.217675 | 0.007479 | 0.018054 | 0.005416 | 0.000516 | 0.096114 | 0.867435 | 0.015205 | 0.491145 | 1.0 |
3 rows × 82 columns
The column where State_Abbreviation is US represent the country average and where State_Abbreviation is state name represent the state average.
County_FIPS_Code is 0 if county name is state itself.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3194 entries, 0 to 3193 Data columns (total 82 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State_FIPS_Code 3194 non-null object 1 County_FIPS_Code 3194 non-null object 2 5_digit_FIPS_Code 3194 non-null object 3 State_Abbreviation 3194 non-null object 4 Name 3194 non-null object 5 Premature_Death 3134 non-null object 6 Poor_or_Fair_Health 3192 non-null object 7 Poor_Physical_Health_Days 3192 non-null object 8 Poor_Mental_Health_Days 3192 non-null object 9 Low_Birthweight 3088 non-null object 10 Adult_Smoking 3192 non-null object 11 Adult_Obesity 3192 non-null object 12 Food_Environment_Index 3161 non-null object 13 Physical_Inactivity 3192 non-null object 14 Access_to_Exercise_Opportunities 3132 non-null object 15 Excessive_Drinking 3192 non-null object 16 Alcohol_Impaired_Driving_Deaths 3167 non-null object 17 Sexually_Transmitted_Infections 3071 non-null object 18 Teen_Births 3005 non-null object 19 Uninsured 3193 non-null object 20 Primary_Care_Physicians 3047 non-null object 21 Dentists 3108 non-null object 22 Mental_Health_Providers 2993 non-null object 23 Preventable_Hospital_Stays 3123 non-null object 24 Mammography_Screening 3173 non-null object 25 Flu_Vaccinations 3176 non-null object 26 High_School_Completion 3194 non-null object 27 Some_College 3194 non-null object 28 Unemployment 3193 non-null object 29 Children_in_Poverty 3193 non-null object 30 Income_Inequality 3187 non-null object 31 Children_in_Single_Parent_Households 3193 non-null object 32 Social_Associations 3194 non-null object 33 Injury_Deaths 3089 non-null object 34 Air_Pollution___Particulate_Matter 3167 non-null object 35 Drinking_Water_Violations 3149 non-null object 36 Severe_Housing_Problems 3194 non-null object 37 Driving_Alone_to_Work 3194 non-null object 38 Long_Commute___Driving_Alone 3194 non-null object 39 Life_Expectancy 3124 non-null object 40 Premature_Age_Adjusted_Mortality 3134 non-null object 41 Frequent_Physical_Distress 3192 non-null object 42 Frequent_Mental_Distress 3192 non-null object 43 Diabetes_Prevalence 3192 non-null object 44 HIV_Prevalence 2735 non-null object 45 Food_Insecurity 3194 non-null object 46 Limited_Access_to_Healthy_Foods 3161 non-null object 47 Insufficient_Sleep 3192 non-null object 48 Uninsured_Adults 3193 non-null object 49 Uninsured_Children 3193 non-null object 50 Other_Primary_Care_Providers 3183 non-null object 51 High_School_Graduation 2362 non-null object 52 Reading_Scores 2826 non-null object 53 Math_Scores 2739 non-null object 54 School_Segregation 2962 non-null object 55 School_Funding_Adequacy 3133 non-null object 56 Gender_Pay_Gap 3187 non-null object 57 Median_Household_Income 3192 non-null object 58 Children_Eligible_for_Free_or_Reduced_Price_Lunch 2606 non-null object 59 Child_Care_Cost_Burden 3192 non-null object 60 Child_Care_Centers 3044 non-null object 61 Suicides 2485 non-null object 62 Firearm_Fatalities 2323 non-null object 63 Motor_Vehicle_Crash_Deaths 2743 non-null object 64 Voter_Turnout 3164 non-null object 65 Census_Participation 3142 non-null object 66 Traffic_Volume 3041 non-null object 67 Homeownership 3194 non-null object 68 Severe_Housing_Cost_Burden 3189 non-null object 69 Broadband_Access 3194 non-null object 70 Population 3194 non-null object 71 percent_Below_18_Years_of_Age 3194 non-null object 72 percent_65_and_Older 3194 non-null object 73 percent_Non_Hispanic_Black 3194 non-null object 74 percent_American_Indian_or_Alaska_Native 3194 non-null object 75 percent_Asian 3194 non-null object 76 percent_Native_Hawaiian_or_Other_Pacific_Islander 3194 non-null object 77 percent_Hispanic 3194 non-null object 78 percent_Non_Hispanic_White 3194 non-null object 79 percent_Not_Proficient_in_English 3194 non-null object 80 percent_Female 3194 non-null object 81 percent_Rural 3187 non-null object dtypes: object(82) memory usage: 2.0+ MB
print(df.head(2).T.to_string())
0 1 State_FIPS_Code 00 01 County_FIPS_Code 000 000 5_digit_FIPS_Code 00000 01000 State_Abbreviation US AL Name United States Alabama Premature_Death 7281.9355638 10350.071456 Poor_or_Fair_Health 0.12 0.189 Poor_Physical_Health_Days 3 3.4824161407 Poor_Mental_Health_Days 4.4 5.0732772786 Low_Birthweight 0.0819065527 0.1043276003 Adult_Smoking 0.16 0.195 Adult_Obesity 0.32 0.393 Food_Environment_Index 7 5.3 Physical_Inactivity 0.22 0.278 Access_to_Exercise_Opportunities 0.8423863046 0.6092667226 Excessive_Drinking 0.19 0.1614162693 Alcohol_Impaired_Driving_Deaths 0.2655507901 0.258869637 Sexually_Transmitted_Infections 481.3 552.2 Teen_Births 19.300572586 27.598889304 Uninsured 0.1044496729 0.1182271569 Primary_Care_Physicians 0.0007637606 0.0006579252 Dentists 0.0007246807 0.0004869166 Mental_Health_Providers 0.0029570126 0.0012541973 Preventable_Hospital_Stays 2809 3599 Mammography_Screening 0.37 0.36 Flu_Vaccinations 0.51 0.44 High_School_Completion 0.8887404032 0.8740270016 Some_College 0.6725325979 0.6150082742 Unemployment 0.0535291312 0.0343902829 Children_in_Poverty 0.169 0.227 Income_Inequality 4.8913749294 5.1766763312 Children_in_Single_Parent_Households 0.2512967212 0.3090921916 Social_Associations 9.1296963648 11.910925297 Injury_Deaths 75.899512272 86.9057184 Air_Pollution___Particulate_Matter 7.4 9.3 Drinking_Water_Violations NaN 0.1343283582 Severe_Housing_Problems 0.1696721824 0.1315678879 Driving_Alone_to_Work 0.732358592 0.8378249329 Long_Commute___Driving_Alone 0.365 0.35 Life_Expectancy 78.528894654 74.83594896 Premature_Age_Adjusted_Mortality 358.7460227 499.86855039 Frequent_Physical_Distress 0.09 0.1107739678 Frequent_Mental_Distress 0.14 0.1648429623 Diabetes_Prevalence 0.09 0.13 HIV_Prevalence 379.7 341.6 Food_Insecurity 0.118 0.145 Limited_Access_to_Healthy_Foods 0.0610019647 0.0876054853 Insufficient_Sleep 0.33 0.3924300962 Uninsured_Adults 0.123766561 0.1491000099 Uninsured_Children 0.0539542665 0.0362680404 Other_Primary_Care_Providers 0.0012318702 0.0010861376 High_School_Graduation 0.87 0.9071081634 Reading_Scores 3.0534 2.885602535 Math_Scores 3.003 2.72218766 School_Segregation 0.2454 0.2817412656 School_Funding_Adequacy 1062 -3868.511 Gender_Pay_Gap 0.8100444614 0.7418970988 Median_Household_Income 69717 53990 Children_Eligible_for_Free_or_Reduced_Price_Lunch 0.5308547682 0.53338294 Child_Care_Cost_Burden 0.2659357065 0.2722218184 Child_Care_Centers 6.8638668282 5.5092316855 Suicides 13.818282988 16.200669652 Firearm_Fatalities 12.430330228 22.293899524 Motor_Vehicle_Crash_Deaths 11.591311264 20.205514853 Voter_Turnout 0.6790952146 0.6263600041 Census_Participation 0.652 NaN Traffic_Volume 505.31 213.69282656 Homeownership 0.646331101 0.6939478703 Severe_Housing_Cost_Burden 0.1427574897 0.1194424811 Broadband_Access 0.8700069587 0.8204571454 Population 331893745 5039877 percent_Below_18_Years_of_Age 0.2216565817 0.2226744819 percent_65_and_Older 0.1682705801 0.1763568833 percent_Non_Hispanic_Black 0.1261202919 0.2651199623 percent_American_Indian_or_Alaska_Native 0.0131594526 0.0071444204 percent_Asian 0.0613162595 0.0155043466 percent_Native_Hawaiian_or_Other_Pacific_Islander 0.0026003593 0.0010883202 percent_Hispanic 0.1887563262 0.0478519615 percent_Non_Hispanic_White 0.5930615866 0.6487709918 percent_Not_Proficient_in_English 0.0410440385 0.0102759588 percent_Female 0.5047067187 0.5142542169 percent_Rural 0.193 0.409631829
We can convert most of the columns into float type.
# Fill the NaN with np.nan
df.fillna(np.nan, inplace =True)
# list of cols to convert into float
to_float= [col for col in list(df.columns) if col not in list(df.columns[3:5])]
df[to_float] = df[to_float].apply(pd.to_numeric)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3194 entries, 0 to 3193 Data columns (total 82 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State_FIPS_Code 3194 non-null int64 1 County_FIPS_Code 3194 non-null int64 2 5_digit_FIPS_Code 3194 non-null int64 3 State_Abbreviation 3194 non-null object 4 Name 3194 non-null object 5 Premature_Death 3134 non-null float64 6 Poor_or_Fair_Health 3192 non-null float64 7 Poor_Physical_Health_Days 3192 non-null float64 8 Poor_Mental_Health_Days 3192 non-null float64 9 Low_Birthweight 3088 non-null float64 10 Adult_Smoking 3192 non-null float64 11 Adult_Obesity 3192 non-null float64 12 Food_Environment_Index 3161 non-null float64 13 Physical_Inactivity 3192 non-null float64 14 Access_to_Exercise_Opportunities 3132 non-null float64 15 Excessive_Drinking 3192 non-null float64 16 Alcohol_Impaired_Driving_Deaths 3167 non-null float64 17 Sexually_Transmitted_Infections 3071 non-null float64 18 Teen_Births 3005 non-null float64 19 Uninsured 3193 non-null float64 20 Primary_Care_Physicians 3047 non-null float64 21 Dentists 3108 non-null float64 22 Mental_Health_Providers 2993 non-null float64 23 Preventable_Hospital_Stays 3123 non-null float64 24 Mammography_Screening 3173 non-null float64 25 Flu_Vaccinations 3176 non-null float64 26 High_School_Completion 3194 non-null float64 27 Some_College 3194 non-null float64 28 Unemployment 3193 non-null float64 29 Children_in_Poverty 3193 non-null float64 30 Income_Inequality 3187 non-null float64 31 Children_in_Single_Parent_Households 3193 non-null float64 32 Social_Associations 3194 non-null float64 33 Injury_Deaths 3089 non-null float64 34 Air_Pollution___Particulate_Matter 3167 non-null float64 35 Drinking_Water_Violations 3149 non-null float64 36 Severe_Housing_Problems 3194 non-null float64 37 Driving_Alone_to_Work 3194 non-null float64 38 Long_Commute___Driving_Alone 3194 non-null float64 39 Life_Expectancy 3124 non-null float64 40 Premature_Age_Adjusted_Mortality 3134 non-null float64 41 Frequent_Physical_Distress 3192 non-null float64 42 Frequent_Mental_Distress 3192 non-null float64 43 Diabetes_Prevalence 3192 non-null float64 44 HIV_Prevalence 2735 non-null float64 45 Food_Insecurity 3194 non-null float64 46 Limited_Access_to_Healthy_Foods 3161 non-null float64 47 Insufficient_Sleep 3192 non-null float64 48 Uninsured_Adults 3193 non-null float64 49 Uninsured_Children 3193 non-null float64 50 Other_Primary_Care_Providers 3183 non-null float64 51 High_School_Graduation 2362 non-null float64 52 Reading_Scores 2826 non-null float64 53 Math_Scores 2739 non-null float64 54 School_Segregation 2962 non-null float64 55 School_Funding_Adequacy 3133 non-null float64 56 Gender_Pay_Gap 3187 non-null float64 57 Median_Household_Income 3192 non-null float64 58 Children_Eligible_for_Free_or_Reduced_Price_Lunch 2606 non-null float64 59 Child_Care_Cost_Burden 3192 non-null float64 60 Child_Care_Centers 3044 non-null float64 61 Suicides 2485 non-null float64 62 Firearm_Fatalities 2323 non-null float64 63 Motor_Vehicle_Crash_Deaths 2743 non-null float64 64 Voter_Turnout 3164 non-null float64 65 Census_Participation 3142 non-null float64 66 Traffic_Volume 3041 non-null float64 67 Homeownership 3194 non-null float64 68 Severe_Housing_Cost_Burden 3189 non-null float64 69 Broadband_Access 3194 non-null float64 70 Population 3194 non-null int64 71 percent_Below_18_Years_of_Age 3194 non-null float64 72 percent_65_and_Older 3194 non-null float64 73 percent_Non_Hispanic_Black 3194 non-null float64 74 percent_American_Indian_or_Alaska_Native 3194 non-null float64 75 percent_Asian 3194 non-null float64 76 percent_Native_Hawaiian_or_Other_Pacific_Islander 3194 non-null float64 77 percent_Hispanic 3194 non-null float64 78 percent_Non_Hispanic_White 3194 non-null float64 79 percent_Not_Proficient_in_English 3194 non-null float64 80 percent_Female 3194 non-null float64 81 percent_Rural 3187 non-null float64 dtypes: float64(76), int64(4), object(2) memory usage: 2.0+ MB
df.describe()
| State_FIPS_Code | County_FIPS_Code | 5_digit_FIPS_Code | Premature_Death | Poor_or_Fair_Health | Poor_Physical_Health_Days | Poor_Mental_Health_Days | Low_Birthweight | Adult_Smoking | Adult_Obesity | ... | percent_65_and_Older | percent_Non_Hispanic_Black | percent_American_Indian_or_Alaska_Native | percent_Asian | percent_Native_Hawaiian_or_Other_Pacific_Islander | percent_Hispanic | percent_Non_Hispanic_White | percent_Not_Proficient_in_English | percent_Female | percent_Rural | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3194.000000 | 3194.000000 | 3194.000000 | 3134.000000 | 3192.000000 | 3192.000000 | 3192.000000 | 3088.000000 | 3192.000000 | 3192.000000 | ... | 3194.000000 | 3194.000000 | 3194.000000 | 3194.000000 | 3194.000000 | 3194.000000 | 3194.000000 | 3194.000000 | 3194.000000 | 3187.000000 |
| mean | 30.249530 | 101.886662 | 30351.417032 | 8891.562734 | 0.159942 | 3.511726 | 4.794971 | 0.082138 | 0.199762 | 0.361428 | ... | 0.199929 | 0.090869 | 0.024611 | 0.016971 | 0.001625 | 0.102183 | 0.749862 | 0.016072 | 0.495715 | 0.580467 |
| std | 15.160981 | 107.624838 | 15179.045587 | 2929.948857 | 0.044333 | 0.652486 | 0.628114 | 0.020293 | 0.041210 | 0.046825 | ... | 0.047879 | 0.141564 | 0.077649 | 0.030939 | 0.009667 | 0.139670 | 0.202763 | 0.026852 | 0.023189 | 0.315553 |
| min | 0.000000 | 0.000000 | 0.000000 | 3090.426825 | 0.065000 | 1.849017 | 2.779181 | 0.028871 | 0.067000 | 0.176000 | ... | 0.050729 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.006827 | 0.026802 | 0.000000 | 0.245614 | 0.000000 |
| 25% | 18.000000 | 33.000000 | 18171.500000 | 6868.647904 | 0.125000 | 3.027309 | 4.373272 | 0.068281 | 0.174000 | 0.336000 | ... | 0.169189 | 0.008182 | 0.004311 | 0.005208 | 0.000377 | 0.026999 | 0.630136 | 0.002579 | 0.490583 | 0.325275 |
| 50% | 29.000000 | 77.000000 | 29174.000000 | 8538.518058 | 0.152000 | 3.448386 | 4.813037 | 0.079532 | 0.198000 | 0.366000 | ... | 0.195519 | 0.024266 | 0.007193 | 0.008099 | 0.000721 | 0.048823 | 0.821402 | 0.007069 | 0.499580 | 0.588250 |
| 75% | 45.000000 | 133.000000 | 45074.500000 | 10494.403953 | 0.189000 | 3.946273 | 5.221064 | 0.091418 | 0.226000 | 0.391000 | ... | 0.225286 | 0.104902 | 0.014716 | 0.015733 | 0.001357 | 0.108941 | 0.915241 | 0.017709 | 0.507127 | 0.861214 |
| max | 56.000000 | 840.000000 | 56045.000000 | 30007.870277 | 0.368000 | 6.335031 | 6.945581 | 0.216981 | 0.411000 | 0.532000 | ... | 0.581710 | 0.856197 | 0.922567 | 0.420553 | 0.475610 | 0.962604 | 0.975921 | 0.384369 | 0.570535 | 1.000000 |
8 rows × 80 columns
Relationship between sleep and obesity in LA and CA
x = "Adult_Obesity"
y = "Insufficient_Sleep"
z = "State_Abbreviation"
not_null_mask = df[[x,y,z]].notnull().all(axis=1)
not_null_rows = df[[x,y,z]][not_null_mask]
not_null_rows = not_null_rows.query('State_Abbreviation== "LA" or State_Abbreviation== "CA"')
sns.scatterplot(data=not_null_rows, x = x, y = y, hue = z)
<Axes: xlabel='Adult_Obesity', ylabel='Insufficient_Sleep'>
sns.scatterplot(data=df, x = "Broadband_Access", y = "Math_Scores")
<Axes: xlabel='Broadband_Access', ylabel='Math_Scores'>
Splitting the columns into health factors(variables) and healt outcomes types(target)
target_cols = ['Premature_Death', 'Life_Expectancy', 'Premature_Age_Adjusted_Mortality',
'Poor_or_Fair_Health','Poor_Physical_Health_Days', 'Poor_Mental_Health_Days','Low_Birthweight',
'Frequent_Physical_Distress','Frequent_Mental_Distress', 'Diabetes_Prevalence', 'HIV_Prevalence']
variable_cols = [x for x in df.columns[5:] if x not in target_cols]
df_corr = df.iloc[:,5:].corr()
df_corr.shape
(77, 77)
df_corr = df_corr[variable_cols]
df_corr = df_corr.loc[target_cols]
df_corr.shape
(11, 66)
sns.heatmap(df_corr.T, annot = True, annot_kws={"fontsize":7})
plt.xticks(fontsize=8)
plt.yticks(fontsize=9)
sns.set(rc={'figure.figsize':(10,15)})
Finding the features obesity is most correlated to
obesity_corr = list(df.iloc[:, 5:].corr()[["Adult_Obesity"]].sort_values(by = "Adult_Obesity").index)
obesity_corr = obesity_corr[:5] + obesity_corr[-7:-1]
obesity_corr
['Life_Expectancy', 'Median_Household_Income', 'Some_College', 'Voter_Turnout', 'Broadband_Access', 'Premature_Age_Adjusted_Mortality', 'Frequent_Physical_Distress', 'Adult_Smoking', 'Poor_or_Fair_Health', 'Diabetes_Prevalence', 'Physical_Inactivity']
Few more plots
sns.scatterplot(data=df, x = "Median_Household_Income", y = "Adult_Obesity")
sns.set(rc={'figure.figsize':(6,6)})
sns.scatterplot(data=df, x = "Adult_Smoking", y = "Adult_Obesity")
sns.set(rc={'figure.figsize':(6,6)})
state_df = df[ pd.to_numeric(df["County_FIPS_Code"]) == 0]
sns.barplot(state_df.sort_values(by = ["Adult_Obesity"]), x="Adult_Obesity", y="State_Abbreviation")
plt.ylabel("State Names")
plt.xlabel("Adult obesity")
plt.yticks(fontsize=8)
plt.title("Obesity rates among adults in different US states", {'fontsize': 20} )
sns.set(rc={'figure.figsize':(10,9)})
I plan to explore the datasets why some states or counties are good in health comes and why others are not. Other questions include, "what factors influence the health outcomes the most?","What affects the obesity most?", "Does the state/county location matter in health outcome?","why certain demograohic has a correlation with health results?" and so on.
Hopefully, I can find more data and variables to merge with this one, and with better data analysis, I could figure what variables to include in a model. Here, the model will be used to predict the health outcome such as mortality or obesity based on easily available dataset.